Complex Linear Models

Elizabeth King
Kevin Middleton

This Week

  • Sampling from data sets: complex designs
    • Complex Linear Models
    • Beyond Traditional Models
    • Parallel Processing Methods: Within R
    • Parallel Processing Methods: Rscript

Issues with Complex Models

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4\]

  • Appropriate test statistic
  • Accounting for relationships among variables

Energy expenditure in naked mole rats

Options for randomizing

  1. Randomize Y
    • Randomize Y to all existing combinations of predictors
    • Keeps relationships between predictors
    • Accounts for any issues with distribution of Y
  2. Restricted Randomization for each predictor
    • Only randomize one predictor at a time
    • Keeps effect of the other predictors accounted for
    • Isolates testing for the effect of the focal predictor
  3. Randomize Residuals
    • Replace observations with their corresponding residual
    • Allows the effects of individual factors and interactions to be tested after removing the effects of other factors and interactions

Simulation shows #1 is best for most practical applications

A single randomization

Randomize Y

MM.shuffle <- MM
MM.shuffle$Energy <- sample(MM$Energy)

fm4 <- lm(Energy ~ Mass + Caste, data = MM.shuffle)

MM.shuffle <- MM.shuffle %>% mutate(pred4 = predict(fm4))

ggplot(MM.shuffle, aes(x = Mass, y = Energy, color = Caste)) +
  geom_point(size = 4) +
  geom_line(aes(x = Mass, y = pred4, color = Caste), lwd = 2) +
  theme(legend.justification = c(0, 1), legend.position = c(0.05, 1)) +
  labs(x = "ln Body Mass (g)", y = "ln Daily Energy Expenditure (kJ)") +
  scale_color_viridis_d()

A single randomization

A single randomization

Randomize predictors together

MM.shuffle <- MM[sample(1:nrow(MM)),c("Caste","Mass")]
MM.shuffle$Energy <- MM$Energy

fm4 <- lm(Energy ~ Mass + Caste, data = MM.shuffle)

MM.shuffle <- MM.shuffle %>% mutate(pred4 = predict(fm4))

ggplot(MM.shuffle, aes(x = Mass, y = Energy, color = Caste)) +
  geom_point(size = 4) +
  geom_line(aes(x = Mass, y = pred4, color = Caste), lwd = 2) +
  theme(legend.justification = c(0, 1), legend.position = c(0.05, 1)) +
  labs(x = "ln Body Mass (g)", y = "ln Daily Energy Expenditure (kJ)") +
  scale_color_viridis_d()

A single randomization

Test statistics

fm4 <- lm(Energy ~ Mass + Caste, data = MM)
broom::tidy(summary(fm4))
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -0.0969     0.942    -0.103 0.919    
2 Mass          0.893      0.193     4.63  0.0000589
3 CasteWorker   0.393      0.146     2.69  0.0112   
obs <- broom::tidy(summary(fm4))[2:3, c(1, 5)]

obs
# A tibble: 2 × 2
  term          p.value
  <chr>           <dbl>
1 Mass        0.0000589
2 CasteWorker 0.0112   

Perform Randomization

niter <- 1000
MM.shuffle <- MM
out.ps <- tibble("term" = rep(NA, niter * 2), 
                 "p.value" = rep(NA, niter * 2))
out.ps[1:2, ] <- obs
counter <- 3
for (ii in 2:niter) {
  
  MM.shuffle$Energy <- sample(MM$Energy)
  
  fm.s <- lm(Energy ~ Mass + Caste, data = MM.shuffle)
  
  out.ps[counter:(counter + 1), ] <- broom::tidy(summary(fm.s))[2:3, c(1, 5)]
  counter <- counter + 2
}

Visualize Randomizations

Empirical P-Values

out.ps |>
  filter(term == "Mass") |>
  summarize(mean(p.value <= obs$p.value[1]))
# A tibble: 1 × 1
  `mean(p.value <= obs$p.value[1])`
                              <dbl>
1                             0.001
out.ps |>
  filter(term == "CasteWorker") |>
  summarize(mean(p.value <= obs$p.value[2]))
# A tibble: 1 × 1
  `mean(p.value <= obs$p.value[2])`
                              <dbl>
1                             0.007

Multi-level Models & Exchangeability

Randomization assumes

  • IID
  • Exchangeability of observations under the null hypothesis
  • Grouping variables will often change the unit of exchangeability
    • e.g., paired t-test has a multiple groupings: a treatment category and a pair id category

See Anderson & Ter Braak 2003 on Canvas

Ethynylestradiol exposure in brown trout (Salmo trutta)

Marques de Cunha et al. (2019) split egg masses between a treatment exposed to ethynylestradoil (EE2) and one given a sham control (C_EE2).

Does EE2 exposure affect hatchling length?

  • Observations are not exchangable across sibling groups
  • Randomize treatment and control within sibling groups

Ethynylestradiol exposure in brown trout (Salmo trutta)

RR <- read_excel("Data/Embryos_EE2.xlsx") |>
  filter(Population.x == "Giesse") |>
  mutate(Length1_mm = as.numeric(Length1_mm)) |>
  drop_na()

mod <- lme(Length1_mm ~ Treatment, random = ~ 1 | Sibgroup, data = RR)

summary(mod)
Linear mixed-effects model fit by REML
  Data: RR 
       AIC      BIC    logLik
  240.9652 257.4901 -116.4826

Random effects:
 Formula: ~1 | Sibgroup
        (Intercept)  Residual
StdDev:    0.216796 0.2767693

Fixed effects:  Length1_mm ~ Treatment 
                 Value  Std.Error  DF  t-value p-value
(Intercept)  12.813950 0.03428380 404 373.7611  0.0000
TreatmentEE2 -0.050095 0.02595991 404  -1.9297  0.0543
 Correlation: 
             (Intr)
TreatmentEE2 -0.382

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-4.05917818 -0.59934229  0.01650223  0.63545570  2.68306361 

Number of Observations: 462
Number of Groups: 57 

Randomize within sib groups

sibg <- unique(RR$Sibgroup)

RR.s <- RR |> group_by(Sibgroup) |>
  mutate(Treatment.s = sample(Treatment))

#family 1
RR.s[RR.s$Sibgroup == sibg[1], c(1, 3, 4, 5)]
# A tibble: 8 × 4
# Groups:   Sibgroup [1]
  Sibgroup  Treatment Length1_mm Treatment.s
  <chr>     <chr>          <dbl> <chr>      
1 AFP - 171 C_EE2           13.1 EE2        
2 AFP - 171 C_EE2           12.9 EE2        
3 AFP - 171 C_EE2           12.8 C_EE2      
4 AFP - 171 C_EE2           12.4 C_EE2      
5 AFP - 171 EE2             12.2 EE2        
6 AFP - 171 EE2             12.4 C_EE2      
7 AFP - 171 EE2             12.6 C_EE2      
8 AFP - 171 EE2             13.1 EE2        
#family 2
RR.s[RR.s$Sibgroup == sibg[2], c(1, 3, 4, 5)]
# A tibble: 7 × 4
# Groups:   Sibgroup [1]
  Sibgroup  Treatment Length1_mm Treatment.s
  <chr>     <chr>          <dbl> <chr>      
1 AFP - 172 C_EE2           12.8 EE2        
2 AFP - 172 C_EE2           13.4 C_EE2      
3 AFP - 172 C_EE2           12.9 C_EE2      
4 AFP - 172 C_EE2           13.7 C_EE2      
5 AFP - 172 EE2             13.1 EE2        
6 AFP - 172 EE2             13.0 C_EE2      
7 AFP - 172 EE2             13.2 EE2        

Randomize within sib groups

  • Limited combinations within sib groups but many possible across the dataset
obs <- summary(mod)$tTable[2, 4]

niter <- 1000
output <- tibble("ts" = rep(NA, niter))
output$ts[1] <- obs
for (ii in 2:niter) {
  RR.s <- RR |> group_by(Sibgroup) |>
    mutate(Treatment.s = sample(Treatment))
  mod <- lme(Length1_mm ~ Treatment.s, random = ~ 1 | Sibgroup, data = RR.s)
  output$ts[ii] <- summary(mod)$tTable[2, 4]
}

Visualize Randomizations

General Considerations

  • Every randomization is testing against a particular null hypothesis
    • Define this hypothesis clearly
  • Randomization makes assumptions
  • Complex designs require thought about sampling strategy